Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging

ثبت نشده
چکیده

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results. Large tagsets and tiered tagging The paper discusses experiments and results concerned with tagging highly inflectional languages, based on multiple register diversified language models (LMs). The case study language is Romanian, for the tagset of which we adopted the internationally accepted set of EAGLES guidelines for morpho-syntactic encoding of lexica. The Romanian lexicon, EAGLES compliant, was built within the MULTEXT-EAST Copernicus Joint Project and the description of its almost half a million wordforms used a set of 614 morpho-syntactic description (MSD) codes. A full description of the encoding scheme we used is given in (Erjavec & Monachini, 1995). Multilingual content analyses of the MULTEXT-EAST lexica and corpora are

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...

متن کامل

Tiered Tagging and Combined Language Models Classifiers

We address the problem of morpho-syntactic disambiguation of arbitrary texts in a highly innectional natural language. We use a large tagset (615 tags), EAGLES and MULTEXT compliant 5]. The large tagset is internally mapped onto a reduced one (82 tags), serving statistical disambiguation, and a text disambiguated in terms of this tagset is subsequently subject to a recovery process of all the i...

متن کامل

Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, FDOOHG 7LHUHG 7DJJLQJ 7XIL , 1999), exploits a reduced set of tags derived by removing severa...

متن کامل

Developping Tools and Building Linguistic Resources for Vietnamese Morpho-syntactic Processing

Vietnamese is spoken by about 80 millions people around the world, yet very few concrete works on this language have been noticed in Natural Language Processing (NLP) until now. The fundamental problems in automatic analysis of Vietnamese, such as part-ofspeech (POS) tagging, parsing, etc. are extremely difficult due to the lack of formal linguistic knowledge on one hand, and the specificities ...

متن کامل

Morpho-syntactic ambiguity and tagset design for Hungarian

The paper reports on work in progress to develop a tag set for Hungarian. The rich morphological structure of the language makes tagging feasible only after a full-scale morphological analysis, which results in a magnitude of patterns that do not easily translate into a corpus tag set of manageable size. The paper analyses the extent and types of morpho-syntactic ambiguity found in a 21m word s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000